Self-organizing
Data Mining Today, there is an increased need to discover information - contextual data - non obvious and valuable for decision making from a large collection of data efficiently. This is an interactive and iterative process of various subtasks and decisions and is called Knowledge Discovery from Data. The engine of Knowledge Discovery - where data is transformed into knowledge for decision making - is Data Mining. There are very different data mining tools available and many papers are published describing data mining techniques. We think that the priority for advanced data mining is to limit the involvement of users in the entire data mining process to the inclusion of well-known a priori knowledge, exclusively, while making this process more automated and more objective. Most users' primary interest is in model results proper without having to have extensive knowledge of mathematical, cybernetical and statistical techniques or sufficient time for dialog driven modeling tools. Soft computing, i.e., Fuzzy Modeling, Neural Networks, Genetic Algorithms and other methods of automatic model generation, is a way to mine data by generating mathematical models from empirical data more or less automatically. In the past years there has been much publicity about the ability of Artificial Neural Networks to learn and to generalize despite important problems with design, development and application of Neural Networks:
In contrast to traditional Neural Networks that use
KnowledgeMiner employs principles of evolution - inheritance, mutation and selection - for generating a network structure systematically enabling combined automatic model structure synthesis and model validation. Models are generated adaptively from data in the form of networks of active neurons in an evolutionary fashion of repetitive generation of populations of competing models of growing complexity, their validation and selection until an optimally complex model - not too simple and not too complex - have been created. That is, growing a tree-like network out of seed information (input and output variables' data) in an evolutionary fashion of pairwise combination and survival-of-the-fittest selection from a simple single individual (neuron) to a desired final, not overspecialized behavior (model). Neither, the number of neurons and the number of layers in the network, nor the actual behavior of each created neuron is predefined. All this is adjusted during the process of self-organization, and therefore, is called self-organizing data mining. The differences between traditional Neural Networks and
this new approach are focusing on Statistical Learning
Networks and Induction. The first Statistical Learning
Network algorithm of this new type, the Group Method of Data
Handling (GMDH), was developed by A.G.
Ivakhnenko in 1967. Considerable improvements were
introduced in the 1970s and 1980s by versions of the
Polynomial Network Training algorithm (PNETTR) by Barron and
the Algorithm for Synthesis of Polynomial Networks (ASPN) by
Elder when Adaptive Learning Networks and GMDH were flowing
together. Further enhancements of the GMDH algorithm have
been realized in KnowledgeMiner. Why Data
Mining is needed models make it possible:
Therefore mathematical modeling formed the core of almost all decision support systems. Models can be derived from existing theory (theory driven approach or theoretical systems analysis) and/or from data (data driven approach or experimental systems analysis). a. theory driven approach For complex ill-defined systems, such as economic, ecological, social, biological a.o. systems, we have insufficient a priori knowledge about the relevant theory of the system under research. Modeling based on a theory driven approach is considerably affected by the fact that the modeler often has to know things about the system that are generally impossible to find. This concerns uncertain a priori information with regard to the selection of the model structure (factors of influence and functional relations) as well as insufficient knowledge about interference factors (actual interference factors and factors of influence which can not be measured). According to this, insufficient a priori information concerns the required a priori knowledge on the object under research be connected to:
In order to overcome these problems and to deal with ill-defined systems and, in particular, with insufficient a priori knowledge, there is a need to find ways on how it is possible, with the help of emergent information engineering, to reduce the time and resource intensive model formation process required before one can start initial task solving. Computer-aided design of mathematical models may soon prove as highly valuable in bridging the gap. b. data driven approach Modern information technologies delivers a flood of data and there is a question how to leverage them. Commonly, statistically based principles are used for model formation. But with them there is always the need to have a priori knowledge about the structure of the mathematical model. In addition to the epistemological problems of commonly used statistical principles of model formation, there are several methodological problems which may arise in conjunction with the insufficience of a priori information. This indeterminacy of the starting position marked by the subjectivity and incompletedness of the theoretical knowledge and an insufficient data basis leads to several methodological problems as described in [Lemke/Müller,1997]. Knowledge discovery from data and specifically data mining techniques and tools can assist humans in analyzing the mountains of data and to turn information located in the data into successful decision making. Data mining includes not just a single analytical technique but many methods and techniques depending on the nature of the enquiry. These methods contain data visualization, tree-based methods and methods of mathematical statistics as well as those for knowledge extraction from data using self-organizing modeling to turn information located in the data into successful decision making. Data mining is an interactive and iterative process of numerous subtasks and decisions such as data selection and pre-processing, choice and application of data mining algorithms and analysis of the extracted knowledge. Most important for a more sophisticated data mining application is to try to limit the involvement of users in the overall data mining process to the inclusion of existing a priori knowledge while making this process more automated and more objective. Automatic model generation like GMDH, Analog Complexing, and GMDH-based Fuzzy Rule Induction is based on these demands and provides sometimes the only way to generate models of ill-defined problems. |
Contact:
knowledgeminer@iworld.to
Date Last Modified: 03/23/99